An Evaluation Study of Machine Learning Techniques for Identifying Spam
نویسندگان
چکیده
In this work, we investigate the use of two kinds of machine learning techniques Decision Trees and Naive Bayes applied to the problem of spam classification. We first consider building a decision tree for this purpose and then, investigate building an ensemble of decision trees using boosting. Decision trees are seen to give fairly good classification accuracy of around 92% and with the use of an ensemble, this accuracy further increases to around 95%. However, overfitting with respect to the training data is also observed in both cases. Our explorations with the Naive Bayes classifier show that it yields extremely high classification accuracy of the order of 99%. Its performance was however found to further improve by using a simpler binomial model assumption and incorporating one of the modifications to Naive Bayes suggested in (Rennie et al. 2003).
منابع مشابه
A Classification Method for E-mail Spam Using a Hybrid Approach for Feature Selection Optimization
Spam is an unwanted email that is harmful to communications around the world. Spam leads to a growing problem in a personal email, so it would be essential to detect it. Machine learning is very useful to solve this problem as it shows good results in order to learn all the requisite patterns for classification due to its adaptive existence. Nonetheless, in spam detection, there are a large num...
متن کاملAn Effective Model for SMS Spam Detection Using Content-based Features and Averaged Neural Network
In recent years, there has been considerable interest among people to use short message service (SMS) as one of the essential and straightforward communications services on mobile devices. The increased popularity of this service also increased the number of mobile devices attacks such as SMS spam messages. SMS spam messages constitute a real problem to mobile subscribers; this worries telecomm...
متن کاملMachine learning algorithms in air quality modeling
Modern studies in the field of environment science and engineering show that deterministic models struggle to capture the relationship between the concentration of atmospheric pollutants and their emission sources. The recent advances in statistical modeling based on machine learning approaches have emerged as solution to tackle these issues. It is a fact that, input variable type largely affec...
متن کاملA comparative study for content-based dynamic spam classification using four machine learning algorithms
The growth of email users has resulted in the dramatic increasing of the spam emails during the past few years. In this paper, four machine learning algorithms, which are Naı̈ve Bayesian (NB), neural network (NN), support vector machine (SVM) and relevance vector machine (RVM), are proposed for spam classification. An empirical evaluation for them on the benchmark spam filtering corpora is prese...
متن کاملA Comparison of Ensemble and Case-Base Maintenance Techniques for Handling Concept Drift in Spam Filtering
The problem of concept drift has recently received considerable attention in machine learning research. One important practical problem where concept drift needs to be addressed is spam filtering. The literature on concept drift shows that among the most promising approaches are ensembles and a variety of techniques for ensemble construction has been proposed. In this paper we consider an alter...
متن کامل